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ABSTRACT: This paper proposes human tracking and recognition method in a camera network. Human matching in a 
multi-camera surveillance system is a fundamental issue for increasing the accuracy of recognition in multiple views of 
cameras. In camera network, videos have different characteristics such as pose, scale and illumination. Therefore it is 
necessary to use a hybrid scheme of scale invariant feature transform to detection and recognition human's behaviors. The 
main focus of this paper is to analyze activities for tracking and recognition humans to extract trajectories. Extracting the 
trajectories help to detect abnormal behavior which may be occluded in single- camera surveillance. 
KEYWORDS: Camera network, Multi-camera surveillance, Human 's behavior, Trajectories extraction. 



Tracking and behavior recognition are two fundamental tasks in video surveillance systems which are widely 
employed in commercial applications for purposes of statistics gathering and processing. The number of cameras and 
complexity of surveillance systems have been continuously increasing to have better coverage and accuracy. Multi -camera 
systems become increasingly attractive in machine vision. Applications include multi view object tracking, event detection, 
occlusion handling and etc. In this paper, we develop method for tracking and recognition by a traffic video surveillance 
system of two cameras with a partially overlapping field of view. 

This paper is organized as follows: an overview of the past works in section2. Our proposed architecture and 
algorithm is presented in section3. Results of subjective evaluations and objective performance measurements with respect to 
Ground-truth are presented in section4. Section5 contains the conclusion. 



In the last few years, a lot of works in detecting, describing and matching feature points has deployed. In a camera 
network features' matching between multiple images of a scene is an important component of many computer vision tasks. 
Although the correspondences can be hand selected, such a procedure is hardly conceivable as the number of cameras 
increases or when the camera configuration changes frequently, as in a network of pan -tilt-zoom cameras [1]. Other methods 
for finding correspondences across cameras [2] have been developed through a feature detection method such as the Harris 
corner detection method [3] or scale invariant feature transform [4]. In [5] shown that corners were efficient for tracking and 
estimating structure from motion. A corner detector is robust to changes in rotation and intensity but is very sensitive to 
changes in scale. The Harris detector finds points where the local image geometry has high curvature in the direction of both 
maximal and minimal curvature, as provided by the eigen-values of the Hessian matrix. They develop an efficient method 
for determining the relative magnitude of the eigen-values without explicitly computing them. Such color-based matching 
methods have also been used to track moving objects across cameras [6, 7]. Scale invariant features matching were first 
proposed in [8] and attracted the attention of the computer vision systems for invariant to scale, rotation, and view-point 
variations. Also uses a scale-invariant detector in the difference of Gaussian (DOG) scale space. In [4] fits a quadratic to the 
local scale-space neighborhood to improve accuracy. Then creates a Scale Invariant Feature Transform descriptor to match 
key -points using a Euclidean distance metric in an efficient best-bin first algorithm where a match is rejected if the ratio of 
the best and second best matches is greater than a threshold. 

A comparative study of many local image descriptors [9] shows the superiority of this method with respect to other 
feature descriptors for the case of several local transformations. In [10] develop a scale-invariant Harris detector that keeps 
key points at each scale only if it's a maximum in the Laplacian scale-space [11]. More recently, in [12] integrate edge-based 
features with local feature-based recognition using a structure similar to shape contexts [13] for general object-class 
recognition. In [14] propose a matching technique based on the Harris corner detector and a description based on the Fourier 
transform to achieve in variance to rotation. Harris corners are also used in [15], where rotation in variance is obtained by a 
hierarchal sampling that starts from the direction of the gradient. In [16] introduce the concept of maximally stable external 
region to be used for robust matching. These regions are connected components of pixels which are brighter or darker than 
pixels on the region's contour; they are invariant to affine and perspective transform, and to monotonic transformation of 
image intensities. Among the many recent works populating the literature on key -point detection, it is worth mentioning the 
scale and affine invariant interesting points recently proposed in [17], as they appear to be among the most promising key- 
point detectors to date. The detection algorithm can be sketched as follows: first Harris corners are detected at multiple 
scales, and then points at which a local measure of variation is maximal over scale are selected. This provides a set of 
distinctive points at the appropriate scale. Finally, an iterative algorithm modifies location, scale, and neighborhood of each 
point and converges to affine invariant points. In [18] describe a matching procedure wherein motion trajectories of objects 
tracked in different cameras are matched so that the overall ground plane can be aligned across cameras following a 
homograph transformation [19-21]. 



I. INTRODUCTION 



II. 



PAST WORKS ON MULTI- CAMERA SURVEILLANCE 
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III. PROPOSED ARCHITECTURE 
First, we review the function of a typical single-camera and multi-camera surveillance system as presented in our 
previous work [22], the function of a typical single-camera surveillance system is illustrated in Fig.l. The first part of the 
processing flowchart is very general, which is marked "Detecting & Matching Features Extraction Pipeline". This pipeline 
may produce all target information (pose, scale, illumination, color, shape, etc.), and potentially the description of the scene. 
The end of the processing pipeline, the human tracking and classification is done. 
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Fig. 1 : Single Camera Processing 



Only the matching features have to be stored, instead of high quality video suitable for automated processing. This 
method enables the multi-camera surveillance system. The video surveillance system, as described in the above, cannot 
provide an adequate solution for many applications [23-27]. A multi-camera surveillance system tracking targets from one 
camera to the next can overcome all these limitations. A typical multi-camera surveillance network is illustrated in Fig. 2. 
Fusing at the matching features level requires merging all the features from the cameras on to a full representation of the 
environment. This approach distributes the most time consuming processing between the different cameras, and minimizes 
communication, since only the extracted features needs to be transmitted, no video or image. Given these advantages, system 
communicates only the matching features for fusion. 
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Fig. 2: Multi camera network Processing 



The problem of multi-view activity recognition has been addressed in many papers, but almost the information of 
multiple views is fused centrally. Our proposed framework is decentralized. The pose of cameras at intersection is shown in 
Fig.3. 




Fig 3: Camera setup in a network 



In Fig.4, the structure camera network is illustrated. Each of the cameras has processing cores in four levels. The 
input stream is fed to detection level. At the decision level, control commands are issued to classify the detected human 
based on extracted description features. Processing cores in three upper levels exchange the requisite information to track 
and recognition more accurately. 




Fig 4 Structure camera network 
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The Scale Invariant Feature Transform has been shown to perform better than other local descriptors [9] . Given a 
feature point, the descriptor computes the gradient vector for each pixel in the feature point's neighborhood and builds a 
normalized histogram of gradient directions. The descriptor creates a neighborhood that is partitioned into sub -regions of 
4x4 pixels each. For each pixel within a sub-region and adds the pixel's gradient vector to a histogram of gradient directions 
by quantizing each orientation to one of 8 directions and weighting the contribution of each vector by its magnitude. 
Principle features of our scheme are summarized as Communication Efficiency: camera network is particularly well -suited 
for low bandwidth; and unsupervised: the method does not require the pre -calibration into the scene and, hence, can be used 
in traffic scenes where the system administrator may not have control over the activities taking place. Fig. 5 shows the 
matching results using descriptor created for a corresponding pair of points. 




Fig 5 Matching results using descriptor. 



IV. EXPERIMENTAL RESULTS 

We have experimented with various feature detectors including the Harris corner detector (HCD), curvilinear 
structure detector (CSD), and difference of Gaussian (DoG) scale space. In Fig.6, the experimental result contain the 
comparison of these methods is shown. We showed that suing SIFT point descriptors in a camera network can improves the 
performance with respect to the other calibration systems. Here it is shown that descriptor lead to excellent performances 
compared to other existing approaches. As explained, description is computed as follows: once a key -point is located and its 
scale has been estimated, one or more orientations are assigned to it based on local image gradient direction around the key- 
point. Then, image gradient magnitude and orientation are sampled around the key-point, using the scale of the key-point to 
select the level of Gaussian blur. The gradient orientations obtained are rotated with respect to the key -point orientation 
previously computed. Finally, the area around the key-point is divided in sub-regions, each of which is associated an 
orientations histogram weighted with the magnitude. 



Table 1: Number of matching by features descriptors. 
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In table2 counting and classification results are presented. As shown, the overall accuracy is about 90% for using 
DOG detector in counting cars and about 94% for Bus and Trucks. This system can be as an input to calibration system in 
multi-camera surveillance system. 



Table2. Counting and classification results 



Number of object matching by algorithm 
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V. Conclusion 

In this paper we considered the problem of features matching in a camera network with overlapping fields of view. 
We showed that suing SIFT point descriptors in a camera network can improves the performance with respect to the other 
calibration systems. In particular it returned good results for scale changes, zoom and image plane rotations, and large view- 
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point variations. These conclusions are supported by an extensive experimental evaluation, on different scenes. Therefore, 
tracking and recognition using SIFT becomes feasible. This should result in highly robust trackers. 
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